We are using Forest Fires Data Set UCI - Machine Learning Repository- This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal by using meteorological and other data.

Introduction

Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment. Then the codes chunks below uses glimpse() to display the data structure of will do the job.

packages = c('olsrr', 'corrplot', 'ggpubr', 
             'readxl', 'ggstatsplot',
             'funModeling', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}
## Loading required package: olsrr
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: ggpubr
## Loading required package: ggplot2
## Loading required package: readxl
## Loading required package: ggstatsplot
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      PsyArxiv. doi:10.31234/osf.io/p7mku
## Loading required package: funModeling
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
## Loading required package: tidyverse
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## Found more than one class "atomicVector" in cache; using the first, from namespace 'Matrix'
## Also defined by 'Rmpfr'
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x dplyr::src()       masks Hmisc::src()
## x dplyr::summarize() masks Hmisc::summarize()
data_forestfires <- read.csv("/cloud/project/forestfires.csv",
                            header=TRUE,sep=",")

glimpse(data_forestfires)
## Rows: 517
## Columns: 13
## $ X     <int> 7, 7, 7, 8, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 5, 8, 6, 6, 6, 5…
## $ Y     <int> 5, 4, 4, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4…
## $ month <chr> "mar", "oct", "oct", "mar", "mar", "aug", "aug", "aug", "sep", "…
## $ day   <chr> "fri", "tue", "sat", "fri", "sun", "sun", "mon", "mon", "tue", "…
## $ FFMC  <dbl> 86.2, 90.6, 90.6, 91.7, 89.3, 92.3, 92.3, 91.5, 91.0, 92.5, 92.5…
## $ DMC   <dbl> 26.2, 35.4, 43.7, 33.3, 51.3, 85.3, 88.9, 145.4, 129.5, 88.0, 88…
## $ DC    <dbl> 94.3, 669.1, 686.9, 77.5, 102.2, 488.0, 495.6, 608.2, 692.6, 698…
## $ ISI   <dbl> 5.1, 6.7, 6.7, 9.0, 9.6, 14.7, 8.5, 10.7, 7.0, 7.1, 7.1, 22.6, 0…
## $ temp  <dbl> 8.2, 18.0, 14.6, 8.3, 11.4, 22.2, 24.1, 8.0, 13.1, 22.8, 17.8, 1…
## $ RH    <int> 51, 33, 33, 97, 99, 29, 27, 86, 63, 40, 51, 38, 72, 42, 21, 44, …
## $ wind  <dbl> 6.7, 0.9, 1.3, 4.0, 1.8, 5.4, 3.1, 2.2, 5.4, 4.0, 7.2, 4.0, 6.7,…
## $ rain  <dbl> 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ area  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Regresion Basic lm()

An ordinary least squares regression can be done with the command ols_regress, where the parameters that need to be entered are as follows * output model (outvar ~ var + var2 + ... + varn) * data (data = matrix_var) E.g.

First, we will build a simple linear regression model by using Price as the dependent variable and wind as the independent variable.

lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).

The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

data_forestfires.slr <- lm(formula = temp~wind, data = data_forestfires)

summary(data_forestfires.slr)
## 
## Call:
## lm(formula = temp ~ wind, data = data_forestfires)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5839  -3.2590   0.1979   3.7785  14.1979 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.8464     0.6117  35.713  < 2e-16 ***
## wind         -0.7361     0.1391  -5.292 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.66 on 515 degrees of freedom
## Multiple R-squared:  0.05158,    Adjusted R-squared:  0.04974 
## F-statistic: 28.01 on 1 and 515 DF,  p-value: 1.79e-07

The output report reveals that the Fire Probability can be explained by using the formula:

  *y = 21.8464 + -0.7361x1*
  

The R-squared of 0.05158 reveals that the simple regression model built is able to explain about 5% of the windy.

Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of trade-in prices. This will allow us to infer that simple linear regression model above is a good estimator of Fire Probability.

The Coefficients: section of the report reveals that the p-values of both the estimates of the Intercept and wind are smaller than 0.001. In view of this, the null hypothesis of the B0 and B1 are equal to 0 will be rejected. As a results, we will be able to infer that the B0 and B1 are good parameter estimates.

To visualise the best fit curve on a scatterplot, we can incorporate lm() as a method function in ggplot’s geometry as shown in the code chunk below.
## Initial Model lm()

data_forestfires.mlr <- lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)

summary(data_forestfires.mlr)
## 
## Call:
## lm(formula = temp ~ wind + FFMC + DMC + DC + ISI + RH + rain + 
##     area, data = data_forestfires)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5990  -2.2008   0.0206   2.0001  11.2634 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.4317692  3.3349617   7.026 6.86e-12 ***
## wind        -0.4825665  0.0916281  -5.267 2.06e-07 ***
## FFMC        -0.0277699  0.0371820  -0.747  0.45549    
## DMC          0.0262144  0.0035803   7.322 9.65e-13 ***
## DC           0.0047590  0.0008921   5.335 1.44e-07 ***
## ISI          0.2734885  0.0417666   6.548 1.43e-10 ***
## RH          -0.1837996  0.0105480 -17.425  < 2e-16 ***
## rain         1.7339747  0.5393131   3.215  0.00139 ** 
## area         0.0026876  0.0024950   1.077  0.28190    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.583 on 508 degrees of freedom
## Multiple R-squared:  0.6252, Adjusted R-squared:  0.6193 
## F-statistic: 105.9 on 8 and 508 DF,  p-value: < 2.2e-16

The R-squared of 0.6252 reveals that the simple regression model built is able to explain about 62% of the trade-in prices.

Residual vs Fitted Values Plot

Used to detect non-linearity, unequal error variances and outliers.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_fit(model)

DFBETAs Panel

DFBETA -> Degrees of Freedom: measures the difference in each parameter estimate with and without the influential observation.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_dfbetas(model)

Residual Fit Spread Plot

Used to detect non-linearity, influential observations and outliers.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_fit_spread(model)

Breusch Pagan Test

Used to test for heteroscedasticity. The variance of the errors from the regression are tested for dependence from the independent variables with a chi^2 test.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_breusch_pagan(model)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data               
##  --------------------------------
##  Response : temp 
##  Variables: fitted values of temp 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.594941 
##  Prob > Chi2   =    0.03206645

Collinearity Diagnostics

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_coll_diag(model)
## Tolerance and Variance Inflation Factor
## ---------------------------------------
##   Variables Tolerance      VIF
## 1      wind 0.9230740 1.083337
## 2      FFMC 0.5905279 1.693400
## 3       DMC 0.4731140 2.113655
## 4        DC 0.5080208 1.968423
## 5       ISI 0.6859830 1.457762
## 6        RH 0.8397653 1.190809
## 7      rain 0.9764642 1.024103
## 8      area 0.9862348 1.013957
## 
## 
## Eigenvalue and Condition Index
## ------------------------------
##    Eigenvalue Condition Index    intercept         wind         FFMC
## 1 6.334222555        1.000000 5.381337e-05 3.217684e-03 5.280743e-05
## 2 0.996429442        2.521293 1.135646e-06 9.329011e-07 8.962059e-07
## 3 0.948920997        2.583638 5.818602e-06 2.469398e-04 4.994505e-06
## 4 0.301757141        4.581608 2.087390e-04 1.579064e-01 1.097961e-04
## 5 0.181105029        5.914001 1.193019e-04 1.694758e-02 1.159695e-05
## 6 0.110785950        7.561437 4.511868e-04 6.715867e-01 3.723563e-04
## 7 0.080618110        8.864012 2.758447e-03 1.817315e-02 2.641638e-03
## 8 0.045047543       11.857991 9.530865e-03 1.257576e-01 9.980588e-03
## 9 0.001113231       75.431713 9.868707e-01 6.162940e-03 9.868253e-01
##            DMC           DC          ISI           RH         rain         area
## 1 2.502828e-03 1.918475e-03 2.982489e-03 2.200411e-03 2.939567e-04 0.0013208901
## 2 7.550492e-07 2.144028e-05 8.361061e-06 5.231079e-05 8.619953e-01 0.1083149354
## 3 9.026652e-07 6.087723e-05 2.355100e-04 4.783405e-04 1.149303e-01 0.8734670753
## 4 1.410558e-01 5.941679e-02 2.859691e-03 2.302289e-02 1.498850e-04 0.0043570585
## 5 1.019015e-02 1.079331e-02 5.038357e-01 1.527821e-01 2.743713e-05 0.0003127117
## 6 1.175409e-01 1.157872e-02 1.715947e-01 1.896446e-01 1.030713e-04 0.0071085201
## 7 3.971155e-01 3.263182e-01 6.016093e-02 2.252214e-01 1.385772e-02 0.0008918784
## 8 2.852202e-01 5.897914e-01 9.035461e-02 2.621895e-01 5.064681e-03 0.0041208058
## 9 4.637301e-02 1.008694e-04 1.679680e-01 1.444084e-01 3.577598e-03 0.0001061247

Stepwise regression

An iterative method in which features are entered and removed one by one and decided upon by performing f or t tests to know if they were significant.

model = lm(y ~ ., data = surgical)
k = ols_step_both_p(model)
k
## 
##                                 Stepwise Selection Summary                                 
## ------------------------------------------------------------------------------------------
##                         Added/                   Adj.                                         
## Step     Variable      Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ------------------------------------------------------------------------------------------
##    1    liver_test     addition       0.455       0.444    62.5120    771.8753    296.2992    
##    2     alc_heavy     addition       0.567       0.550    41.3680    761.4394    266.6484    
##    3    enzyme_test    addition       0.659       0.639    24.3380    750.5089    238.9145    
##    4      pindex       addition       0.750       0.730     7.5370    735.7146    206.5835    
##    5        bcs        addition       0.781       0.758     3.1920    730.6204    195.4544    
## ------------------------------------------------------------------------------------------
plot(k)

Stepwise AIC Backward Regression

An iterative method in which features are removed one by one and decided upon by checking the Akaike Information Criterion (less is better).

model = lm(y ~ ., data =surgical)
k = ols_step_backward_aic(model)
k
## 
## 
##                         Backward Elimination Summary                         
## ---------------------------------------------------------------------------
## Variable        AIC          RSS          Sum Sq        R-Sq      Adj. R-Sq 
## ---------------------------------------------------------------------------
## Full Model    736.390    1825905.713    6543614.824    0.78184      0.74305 
## alc_mod       734.407    1826477.828    6543042.709    0.78177      0.74856 
## gender        732.494    1829435.617    6540084.920    0.78142      0.75351 
## age           730.620    1833716.447    6535804.090    0.78091      0.75808 
## ---------------------------------------------------------------------------
plot(k)

References

https://olsrr.rsquaredacademy.com/articles/intro.html

One of the assumptions for an ordinary least squares regression is that the errors are constant across the domain of the model (variance does not vary with input). This is known as homoscedasticity, and its counterpart is heteroscedasticity.

As homoscedasticity is an assumption of the formulation of OLS, if the model has heteroscedasticity the model’s hypothesis test are no longer valid, and the estimators are not BLUE (Best Linear Unbiased Estimator).

Bartlett Test

Used to test if variances across samples is equal. Its null hypothesis is that the variances are equivalent. \[ H_0: \sigma_1^2 = \sigma_2^2 = ... = \sigma_k^2 \] Its alternative hypothesis is that at least one pair of variances are different. \[ H_a: \sigma_i^2 \neq \sigma_j^2 \ \ \ \text{for at least one pair} (i,j) \]

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_breusch_pagan(model)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data               
##  --------------------------------
##  Response : temp 
##  Variables: fitted values of temp 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.594941 
##  Prob > Chi2   =    0.03206645

Using independent variables

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_breusch_pagan(model, rhs = TRUE)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                     Data                     
##  --------------------------------------------
##  Response : temp 
##  Variables: wind FFMC DMC DC ISI RH rain area 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    8 
##  Chi2          =    22.89758 
##  Prob > Chi2   =    0.003498264

Using independent variables and performing multiple tests

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                     Data                     
##  --------------------------------------------
##  Response : temp 
##  Variables: wind FFMC DMC DC ISI RH rain area 
## 
##            Test Summary (Unadjusted p values)          
##  ----------------------------------------------------
##   Variable             chi2        df         p       
##  ----------------------------------------------------
##   wind             11.605954664     1    0.0006574099 
##   FFMC              8.358981197     1    0.0038378568 
##   DMC               6.787927955     1    0.0091776372 
##   DC                3.261272477     1    0.0709340620 
##   ISI               1.301192780     1    0.2539954635 
##   RH                0.063113574     1    0.8016408127 
##   rain              0.109431754     1    0.7407920054 
##   area              0.001040708     1    0.9742647076 
##  ----------------------------------------------------
##   simultaneous     22.897577316     8    0.0034982642 
##  ----------------------------------------------------

The p-value can be adjusted with the p.adj parameter. Bonferroni, Sidak and Holm’s p-value adjustment are available.

Score Test

Test for heteroscedasticity under the assumption that the errors are iid.

Using fitted values

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_score(model)
## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of temp 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.2252 
##  Prob > Chi2   =    0.07251294

Using independent variables

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_score(model, rhs = TRUE)
## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: wind FFMC DMC DC ISI RH rain area 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    8 
##  Chi2          =    16.07186 
##  Prob > Chi2   =    0.041363

Specify variables

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_score(model, vars = c('DMC', 'DC'))
## 
##  Score Test for Heteroskedasticity
##  ---------------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: DMC DC 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    2 
##  Chi2          =    4.765532 
##  Prob > Chi2   =    0.09229494

F Test

F test for heteroscedasticity under the assumption that the errors are iid.

Using fitted values

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_f(model)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of temp 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    515 
##  F          =    3.232891 
##  Prob > F   =    0.07275873

Using independent variables

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_f(model, rhs = TRUE)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: wind FFMC DMC DC ISI RH rain area 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    8 
##  Den DF     =    508 
##  F          =    2.037345 
##  Prob > F   =    0.04045944

Specify variables

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_f(model, vars = c('DMC', 'DC'))
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: DMC DC 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    2 
##  Den DF     =    514 
##  F          =    2.390979 
##  Prob > F   =    0.09255751

All Possible Regression

Variable Selection Methods

For K potential features there are \(2^k\) distinct possible regression models to be tested.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
k = ols_step_all_possible(model)
k

plot can be used to show the panel of fit criteria for all possible regressions.

plot(k)

Best Subset Regression

Selects the best subset that perform the best at some objective criteria, such as \(R^2\), MS2, AIC, etc.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
k = ols_step_best_subset(model)
k
plot(k)

Stepwise Forward Regression

Buidls a regression model from a set of features by entering them one by one based on p values. Further details can be obtained by setting details = true.

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_forward_p(model)
k
## 
##                              Selection Summary                              
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Entered     R-Square    R-Square      C(p)         AIC        RMSE     
## ---------------------------------------------------------------------------
##    1    month         0.4695      0.4579    666.2349    2983.2447    4.2751    
##    2    RH            0.7530      0.7471     38.1200    2590.0994    2.9202    
##    3    rain          0.7611      0.7549     22.0959    2574.8449    2.8747    
##    4    DMC           0.7668      0.7603     11.4318    2564.3676    2.8431    
##    5    day           0.7740      0.7649     -2.7183    2560.0058    2.8153    
##    6    wind          0.7761      0.7667     -5.4049    2557.1593    2.8050    
##    7    Y             0.7777      0.7678     -6.8441    2555.5736    2.7981    
##    8    ISI           0.7791      0.7688     -7.9609    2554.3024    2.7921    
##    9    area          0.7798      0.7690     -7.4379    2554.7449    2.7907    
## ---------------------------------------------------------------------------
plot(k)

Stepwise Backward Regression

Builds a regression model starting with all features and removing them one by one based on p values. Further details can be obtained by setting details = true.

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_backward_p(model)
k
## 
## 
##                            Elimination Summary                             
## --------------------------------------------------------------------------
##         Variable                  Adj.                                        
## Step    Removed     R-Square    R-Square     C(p)         AIC        RMSE     
## --------------------------------------------------------------------------
##    1    X              0.780      0.7683    -3.9758    2558.1766    2.7949    
##    2    DC            0.7799      0.7687    -5.8057    2556.3563    2.7925    
##    3    FFMC          0.7798       0.769    -7.4379    2554.7449    2.7907    
## --------------------------------------------------------------------------
plot(k)

Stepwise regression

An iterative method in which features are entered and removed one by one and decided upon by performing f or t tests to know if they were significant.

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_both_p(model)
k
## 
##                               Stepwise Selection Summary                                
## ---------------------------------------------------------------------------------------
##                      Added/                   Adj.                                         
## Step    Variable    Removed     R-Square    R-Square      C(p)         AIC        RMSE     
## ---------------------------------------------------------------------------------------
##    1     month      addition       0.469       0.458    666.2350    2983.2447    4.2751    
##    2       RH       addition       0.753       0.747     38.1200    2590.0994    2.9202    
##    3      rain      addition       0.761       0.755     22.0960    2574.8449    2.8747    
##    4      DMC       addition       0.767       0.760     11.4320    2564.3676    2.8431    
##    5      day       addition       0.774       0.765     -2.7180    2560.0058    2.8153    
##    6      wind      addition       0.776       0.767     -5.4050    2557.1593    2.8050    
##    7       Y        addition       0.778       0.768     -6.8440    2555.5736    2.7981    
##    8      ISI       addition       0.779       0.769     -7.9610    2554.3024    2.7921    
## ---------------------------------------------------------------------------------------
plot(k)

Stepwise AIC Forward Regression

An iterative method in which features are added one by one and decided upon by checking the Akaike Information Criterion (less is better).

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_forward_aic(model)
k
## 
##                           Selection Summary                            
## ----------------------------------------------------------------------
## Variable       AIC        Sum Sq        RSS        R-Sq      Adj. R-Sq 
## ----------------------------------------------------------------------
## month        2983.245     8168.284    9229.636    0.46950      0.45794 
## RH           2590.099    13100.067    4297.852    0.75297      0.74709 
## rain         2574.845    13241.138    4156.781    0.76108      0.75490 
## DMC          2564.368    13340.257    4057.662    0.76677      0.76027 
## day          2560.006    13466.661    3931.258    0.77404      0.76493 
## wind         2557.159    13503.342    3894.578    0.77615      0.76665 
## Y            2555.574    13530.260    3867.659    0.77769      0.76779 
## ISI          2554.302    13554.654    3843.265    0.77910      0.76879 
## ----------------------------------------------------------------------
plot(k)

Stepwise AIC Backward Regression

An iterative method in which features are removed one by one and decided upon by checking the Akaike Information Criterion (less is better).

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_backward_aic(model)
k
## 
## 
##                       Backward Elimination Summary                       
## -----------------------------------------------------------------------
## Variable        AIC         RSS        Sum Sq       R-Sq      Adj. R-Sq 
## -----------------------------------------------------------------------
## Full Model    2560.151    3827.305    13570.614    0.78001      0.76787 
## X             2558.177    3827.495    13570.424    0.78000      0.76833 
## DC            2556.356    3828.826    13569.093    0.77993      0.76872 
## FFMC          2554.745    3831.705    13566.215    0.77976      0.76902 
## area          2554.302    3843.265    13554.654    0.77910      0.76879 
## -----------------------------------------------------------------------
plot(k)

Stepwise AIC Regression

An iterative method in which features are added and removed one by one, and decided upon by checking the Akaike Information Criterion (less is better).

model = lm(temp ~ ., data = data_forestfires)
k = ols_step_both_aic(model)
k
## 
## 
##                                  Stepwise Summary                                  
## ---------------------------------------------------------------------------------
## Variable     Method       AIC         RSS        Sum Sq       R-Sq      Adj. R-Sq 
## ---------------------------------------------------------------------------------
## month       addition    2983.245    9229.636     8168.284    0.46950      0.45794 
## RH          addition    2590.099    4297.852    13100.067    0.75297      0.74709 
## rain        addition    2574.845    4156.781    13241.138    0.76108      0.75490 
## DMC         addition    2564.368    4057.662    13340.257    0.76677      0.76027 
## day         addition    2560.006    3931.258    13466.661    0.77404      0.76493 
## wind        addition    2557.159    3894.578    13503.342    0.77615      0.76665 
## Y           addition    2555.574    3867.659    13530.260    0.77769      0.76779 
## ISI         addition    2554.302    3843.265    13554.654    0.77910      0.76879 
## ---------------------------------------------------------------------------------
plot(k)

Measures of Influence

It is possible for a single observation to have a great influence on the results of a regression. It is, therefore, important to detect influential observations to take them into consideration when interpreting results.

Cook’s D Bar Plot

Cook’s distance was introduced in 1977 by American statician R Dennis Cook. It depends on both the residual and leverage. The steps to compute it are: * delete observations one at a time * refit the regression model on the remaining observations * examine how much the fitted values changed after the observation was deleted

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_cooksd_bar(model)

Cook’s D Chart

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_cooksd_chart(model)

DFBETAs Panel

(DF -> Degrees of Freedom)

DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each data point.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_dfbetas(model)

DFFITS Plot

Proposed by Welsch and Kuh (1977). It is the difference between the \(i^{th}\) fitted value obtained from the full data and the \(i^{th}\) observation. Used to identify influential data points.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_dffits(model)

Studentized Residual Plot

Studentized deleted residuals is the deleted residual divided by its estimated standard deviation. If an observation has a studentized deleted residual larger than 3 (absolute value) we call it an outlier.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_stud(model)

Standardized Residual Chart

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_stand(model)

Studentized Residuals vs Leverage Plot

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_lev(model)

Deleted Studentized Residual vs Fitted Values Plot

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_stud_fit(model)

Hadi Plot

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_hadi(model)

Potential Residual Plot

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_pot(model)

Collinearity Diagnostics

Collinearity implies two variables are near perfect linear combinations of one another. Multicollinearity involves more than two variables. In the presence of multicollinearity, regression estimates are unstable and have high standard errors.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_coll_diag(model)
## Tolerance and Variance Inflation Factor
## ---------------------------------------
##   Variables Tolerance      VIF
## 1      wind 0.9230740 1.083337
## 2      FFMC 0.5905279 1.693400
## 3       DMC 0.4731140 2.113655
## 4        DC 0.5080208 1.968423
## 5       ISI 0.6859830 1.457762
## 6        RH 0.8397653 1.190809
## 7      rain 0.9764642 1.024103
## 8      area 0.9862348 1.013957
## 
## 
## Eigenvalue and Condition Index
## ------------------------------
##    Eigenvalue Condition Index    intercept         wind         FFMC
## 1 6.334222555        1.000000 5.381337e-05 3.217684e-03 5.280743e-05
## 2 0.996429442        2.521293 1.135646e-06 9.329011e-07 8.962059e-07
## 3 0.948920997        2.583638 5.818602e-06 2.469398e-04 4.994505e-06
## 4 0.301757141        4.581608 2.087390e-04 1.579064e-01 1.097961e-04
## 5 0.181105029        5.914001 1.193019e-04 1.694758e-02 1.159695e-05
## 6 0.110785950        7.561437 4.511868e-04 6.715867e-01 3.723563e-04
## 7 0.080618110        8.864012 2.758447e-03 1.817315e-02 2.641638e-03
## 8 0.045047543       11.857991 9.530865e-03 1.257576e-01 9.980588e-03
## 9 0.001113231       75.431713 9.868707e-01 6.162940e-03 9.868253e-01
##            DMC           DC          ISI           RH         rain         area
## 1 2.502828e-03 1.918475e-03 2.982489e-03 2.200411e-03 2.939567e-04 0.0013208901
## 2 7.550492e-07 2.144028e-05 8.361061e-06 5.231079e-05 8.619953e-01 0.1083149354
## 3 9.026652e-07 6.087723e-05 2.355100e-04 4.783405e-04 1.149303e-01 0.8734670753
## 4 1.410558e-01 5.941679e-02 2.859691e-03 2.302289e-02 1.498850e-04 0.0043570585
## 5 1.019015e-02 1.079331e-02 5.038357e-01 1.527821e-01 2.743713e-05 0.0003127117
## 6 1.175409e-01 1.157872e-02 1.715947e-01 1.896446e-01 1.030713e-04 0.0071085201
## 7 3.971155e-01 3.263182e-01 6.016093e-02 2.252214e-01 1.385772e-02 0.0008918784
## 8 2.852202e-01 5.897914e-01 9.035461e-02 2.621895e-01 5.064681e-03 0.0041208058
## 9 4.637301e-02 1.008694e-04 1.679680e-01 1.444084e-01 3.577598e-03 0.0001061247

Variance Inflation Factors

They measure the inflation in the variances of the parameter estimates due to collinearities that exist among the featuers. It is a measure of how much the variance of the regression coefficient \(\beta_k\) is inflated by the existence of correlation among the features in the model.

A VIF of 1 means that there is no correlation. The general rule of thumb is that VIFs exceeding 4 warrant further investigation. The steps to calculate VIF are: * Regress the \(k^{th}\) feature on rest of the features in the model. * Compute \(R_k^2\)

\[ VIF = \frac{1}{1-R_k^2} = \frac{1}{Tolerance}\]

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_vif_tol(model)

Condition Index

Most multivariate statistical approaches involve decomposing a correlation matrix into linear combinations of variables. The linear combinations are chosen so that the first combination has the largest possible variance (subject to some restrictions we won’t discuss), the second combination has the next largest variance, subject to being uncorrelated with the first, the third has the largest possible variance, subject to being uncorrelated with the first and second, and so forth. The variance of each of these linear combinations is called an eigenvalue. Collinearity is spotted by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A rule of thumb is to label as large those condition indices in the range of 30 or larger.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_eigen_cindex(model)

Model Fit Assessment

Residual Fit Spread Plot

Plot to detect non-linearity, influential observations and outliers. Consists of side-by-side quantile plots of the centered fit and the residuals.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_fit_spread(model)

Part & Partial Correlations

Correlations

How much each feature uniquely contributes to \(R^2\).

Zero order

Pearson correlation coefficient between the dependent variable and the features.

Part

Unique contribution of feature. How much the \(R^2\) would decrease if the feature was removed from the model.

Partial

How much of the output variance is estimaited by the feature.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_correlations(model)

Observed vs Predicted Plot

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_obs_fit(model)

Lack of Fit F Test

The residual sum of squares can be decomposed into 2 components: * Due to lack of fit * Due to random variation

If most of the error is due to lack of fit and not just random error, the model should be discarded and a new model must be built. The test assesses how much of the error in the prediction is due to the lack of model fit.

data_forestfires.mlr <- lm(formula = temp~wind, data = data_forestfires)
ols_pure_error_anova(data_forestfires.mlr)
## Lack of Fit F Test 
## -----------------
## Response :   temp 
## Predictor:   wind 
## 
##                       Analysis of Variance Table                        
## -----------------------------------------------------------------------
##                 DF      Sum Sq     Mean Sq     F Value        Pr(>F)    
## -----------------------------------------------------------------------
## wind              1    897.4156    897.4156    30.34966    5.706638e-08 
## Residual        515    16500.50    32.03981                             
##  Lack of fit     19    1834.172    96.53539    3.264726    4.752099e-06 
##  Pure Error     496    14666.33    29.56922                             
## -----------------------------------------------------------------------

Diagnostics Panel

lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
## 
## Call:
## lm(formula = temp ~ wind + FFMC + DMC + DC + ISI + RH + rain + 
##     area, data = data_forestfires)
## 
## Coefficients:
## (Intercept)         wind         FFMC          DMC           DC          ISI  
##   23.431769    -0.482566    -0.027770     0.026214     0.004759     0.273489  
##          RH         rain         area  
##   -0.183800     1.733975     0.002688
ols_plot_diagnostics(model)

Variable Contributions

Residual vs Regressor Plots

Graph to determine whether we should add a new predictor to the model already containing other predictors. The residuals from the model is regressed on the new predictor and if the plot shows non random pattern, you should consider adding the new predictor to the model.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_regressor(model, 'temp')

Added Variable Plot

Provides information about the marginal importance of a feature X, given the other features already in the model.

Steps to construct an added variable plot: * Regress Y on all features other than X and store the residuals. * Regress X on all the other features in the model. * Construct a scatter plot with the residuals of the previous steps.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_added_variable(model)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Residual Plus Component Plot

Steps to construct the plot: * Regress Y on all features including X. * Multiply \(e\) with regression coefficient of X. * Construct plot of \(eX\) and \(X\).

The residual plus component plot indicates whether any non-linearity is present in the relationship between Y and X and can suggest possible transformations for linearizing the data.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_comp_plus_resid(model)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

References

https://olsrr.rsquaredacademy.com/articles/regression_diagnostics.html

Residual QQ Plot

Used to violation of normality assumption.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_qq(model)

Residual Normality Test

Tests for normality by means of multiple statistical tests.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_normality(model)
## Warning in ks.test(y, "pnorm", mean(y), sd(y)): ties should not be present for
## the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9887          5e-04 
## Kolmogorov-Smirnov        0.0458         0.2292 
## Cramer-von Mises         33.8991         0.0000 
## Anderson-Darling          1.5921          4e-04 
## -----------------------------------------------

Correlation between observed residuals and expected residuals under normality.

model = lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_test_correlation(model)
## [1] 0.9942176

Residual vs Fitted Values Plot

A scatter plot of residuals vs fitted values to detect non-linearity, unequal error variances and outliers.

model <- lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_fit(model)

A good residual vs fitted plot has: * Residuals randomly spread around the 0 line. * Residuals forming a band around the 0 (like a pipe) indicating homogeneity of error variance. * No residual visibly away from the random pattern.

Residual Histogram

Used to detect violation of normality assumption.

model <- lm(formula = temp~wind+FFMC+DMC+DC+ISI+RH+rain+area, data = data_forestfires)
ols_plot_resid_hist(model)

References

https://olsrr.rsquaredacademy.com/articles/residual_diagnostics.html

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.